Automated Behavioral Coding to Enhance the Effectiveness of Motivational Interviewing in a Chat-Based Suicide Prevention Helpline: Secondary Analysis of a Clinical Trial

Background With the rise of computer science and artificial intelligence, analyzing large data sets promises enormous potential in gaining insights for developing and improving evidence-based health interventions. One such intervention is the counseling strategy motivational interviewing (MI), which has been found effective in improving a wide range of health-related behaviors. Despite the simplicity of its principles, MI can be a challenging skill to learn and requires expertise to apply effectively. Objective This study aims to investigate the performance of artificial intelligence models in classifying MI behavior and explore the feasibility of using these models in online helplines for mental health as an automated support tool for counselors in clinical practice. Methods We used a coded data set of 253 MI counseling chat sessions from the 113 Suicide Prevention helpline. With 23,982 messages coded with the MI Sequential Code for Observing Process Exchanges codebook, we trained and evaluated 4 machine learning models and 1 deep learning model to classify client- and counselor MI behavior based on language use. Results The deep learning model BERTje outperformed all machine learning models, accurately predicting counselor behavior (accuracy=0.72, area under the curve [AUC]=0.95, Cohen κ=0.69). It differentiated MI congruent and incongruent counselor behavior (AUC=0.92, κ=0.65) and evocative and nonevocative language (AUC=0.92, κ=0.66). For client behavior, the model achieved an accuracy of 0.70 (AUC=0.89, κ=0.55). The model’s interpretable predictions discerned client change talk and sustain talk, counselor affirmations, and reflection types, facilitating valuable counselor feedback. Conclusions The results of this study demonstrate that artificial intelligence techniques can accurately classify MI behavior, indicating their potential as a valuable tool for enhancing MI proficiency in online helplines for mental health. Provided that the data set size is sufficiently large with enough training samples for each behavioral code, these methods can be trained and applied to other domains and languages, offering a scalable and cost-effective way to evaluate MI adherence, accelerate behavioral coding, and provide therapists with personalized, quick, and objective feedback.

Document Description.This supplementary material belongs to the article "Automated Behavioral Coding to Enhance the Effectiveness of Motivational Interviewing in a Chat-Based Suicide Prevention Helpline: Secondary Analysis of a Clinical Trial." We give readers detailed insights into our methods and findings and describe them clearly and transparently, contributing to open science.

Related Work
Table S1 Schematic overview of related work that investigated automated coding of MI transcripts in counseling sessions using machine learning techniques.

Study
Application

Feature Categories
Table S2 Overview of all feature categories, descriptions and corresponding feature sets.

Feature category Description
Feature set (# features) 1 Bag of Words (2,000) Word occurrences in a chat message. 2 TF-IDF (2,000) Relative importance of word occurrences across all chat messages.3 Textual features (27) Capturing a variety of textual information such as message length and the number of question marks.
4 Word embeddings (300) Representing words as vectors of numbers in high-dimensional space to capture their semantic and contextual meaning.Confusion Matrix.A confusion matrix is a specific N × N table layout (where N is the number of classes) that allows visualization of the performance of an algorithm.Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class.An example of a confusion matrix is shown in Figure S1.A confusion matrix allows for the computation of different evaluation metrics, such as accuracy, precision, and recall.

Figure S1
Example Confusion Matrix.

Actual value
Predicted outcome True Negatives (TN) Accuracy.The accuracy of a machine learning classifier is the fraction of correct predictions (Equation 1).
Precision.Equation 2 shows the formula for computing the precision of a classifier.Precision is intuitively the ability of a classifier not to label a negative instance as positive.The best value is 1, and the lowest value is 0.

Precision = T P T P + FN
(2) Recall.Equation 3 shows the formula for computing the recall of a classifier, which is the classifier's ability to find all positive samples.A value of 1 is the best, while 0 is the lowest.

Precision = T P T P + FN
(3) F1 Score.The F 1 score (Equation 4) is the harmonic mean of precision and recall.It ranges from 0 to 1, with 1 being the best value and 0 being the worst.The F 1 score is a better evaluation metric for classifiers with unbalanced class distributions because it minimizes the false positives and negatives and seeks a balance between precision and recall.Considering a multiclass classification problem, one could compute the micro and macro average F 1 .The macro-average calculates the metric for each class independently and then takes the mean, giving equal weight to all label classes.A micro-average aggregates the contributions of all classes to compute the average metric, taking class imbalance into account.Another possibility is to treat classification as a multi-label classification problem, where the classifier returns a probability distribution over all classes for each instance.In this case, the sample average F 1 could be computed by calculating the F 1 score for each sample and returning the average.
AUC-ROC.When one needs to evaluate or visualize the performance of a multi-class classification problem, the AUC (Area Under the Curve) -ROC (Receiver Operating Characteristics) curve is a convenient tool (Figure S2).They can provide a richer measure of classification performance than scalar measures such as accuracy.The AUC -ROC curve is a performance measurement for classification problems at various threshold settings.The ROC is a probability curve, and the AUC represents the degree or measure of separability.It tells how much the classifier is capable of distinguishing between classes.The True Positive Rate (TPR) against the False Positive Rate (FPR) presents the ROC curve, where the TPR appears on the y-axis and the FPR on the x-axis.The higher the AUC, the better the model predicts all true positives correctly.An ideal classifier will have a ROC where the graph would hit a True Positive Rate of 100% with zero false positives.For example, when the AUC is 0.7, it indicates a 70% likelihood that the classifier can differentiate between positive and negative classes.
In Cohen's Kappa.The Kappa statistic expresses the level of agreement between two annotators on a classification problem (Cohen, 1960).It is defined as given in Equation 5.
p o represents the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio), and p e is the expected agreement when both annotators assign class labels randomly.p e is estimated using a per-annotator empirical prior over the class labels (Artstein & Poesio, 2008).The kappa statistic is a number between -1 and 1.The maximum value means complete agreement; zero or lower means chance agreement.

Machine Learning Classification Performances
Counselor Behavior

Feature Contributions
Table S7 Most influential features and word combinations contributing to the prediction outcomes and language character per class for counselor-and client behavior.a Stopwords: commonly used words in a language (such as "the", "a", "an", "in" in English).

Class
b Short words: words with less than five characters.
the case of multi-class classification, one can use the One-vs-Rest methodology to plot N AUC-ROC curves, where N is the number of classes.For instance, given three class labels (A, B, and C), one could plot a curve for class A against B and C, another for class B against A and C, and the third for class C against A and B.Moreover, one could compute the micro and macro-average AUC with the same idea as with the F1 score; the micro-average AUC is the weighted-average AUC score (it takes class imbalance into account), and the macro-average AUC is simply the average of the AUC scores for all classes.

Table S5
Machine learning algorithm performances on different feature subsets for predicting counselor behavior.

Table S7 -
Continued from previous page are speaking with . . .; just a moment; I'll be right back to you; close the chat; read back our conversation Support (Sup) neutral sentiment sorry to hear; sad to hear this; I understand your thoughts; that does sound like; I can imagine; good luck; get well soon Client behavior Ask (Ask) # question marks what do you mean by that; what can I do; what should I; but how can I; what if; do you agree with; what kind of help Note.The hashtag character (#) means "number of".